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Abstract 

The Erlangian approximation of Markovian fluid queues leads to the 
problem of computing the matrix exponential of a subgenerator having a 
block-triangular, block-Toeplitz structure. To this end, we propose some 
algorithms which exploit the Toeplitz structure and the properties of gen¬ 
erators. Such algorithms allow to compute the exponential of very large 
matrices, which would otherwise be untreatable with standard methods. 

We also prove interesting decay properties of the exponential of a gener¬ 
ator having a block-triangular, block-Toeplitz structure. 

Keyword Matrix exponential, Toeplitz matrix, circulant matrix, Markov 
generator, fluid queue, Erlang approximation. 


1 Introduction 


The problem we consider here is to compute the exponential of an upper block- 
triangular, block-Toeplitz matrix, that is, a matrix of the kind 


T(U) 


U 0 Ur ... U n —\ 

u 0 

Ur 

0 U 0 


( 1 ) 


where i = 0, ...,n — 1, are m x m matrices. Our interest stems from 
the analysis in Dendievel and Latouche TO] of the Erlangization method for 
Markovian fluid models, but the story goes further back in time. 
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1.1 Origin of the problem 

The Erlangian approximation method was introduced in Asmussen et al. [2] 
in the context of risk processes; it was picked up in Stanford et al. [T5] where 
a connection is established with fluid queues. Other relevant references are 
Stanford et al. [H] where the focus is on modelling the spread of forest fires, 
and Ramaswami et al. m where some basic algorithms are developed. 

Markovian fluid models are two-dimensional processes {(X(f), (p(t)) : t £ 
R + } where {^(f)} is a Markov process with infinitesimal generator A on the 
state space {1,..., to}; to each state i is associated a rate of growth q £ R and 
X(t ) is controlled by tp(t) through the equation 

X(t) = X (0) + f c^^ds, for t > 0. 

Jo 

Performance measures of interest include the distributions of X (t) and of various 
first passage times. Usually, ip(t) is called the phase of the process at time t 
and X(t) its level, and the phase space {1 ,...,to} is partitioned into three 
subsets S+, S- and So such that Ci > 0, Ci < 0 or c* = 0 if i is in S+, S- or 
So, respectively. To simplify our presentation without missing any important 
feature, we assume below that So is empty. 

The first return probabilities of X(t) to its initial level V(0) play a central 
role in the analysis of fluid queues. It is customary to define two matrices T 
and T of first return probabilities: 


^ij = Pr[r < oo ,ip(r) = j\X(0) = 0,^(0) = i], i£ S + , j £ S_, 

and 

% = Pl '[ T < °°> V 3 ( T ) = j 1^(0) = 0,^(0) = i], i £ S~, j £ S + , 

where r = inf{f > 0 : X(t) = 0} is the first passage time to level 0. Thus, 
the entries of T and T are the probability of returning to the initial level after 
having started in the upward, and the downward directions, respectively. 

If the process starts from some level x > 0, then 


Pr[r < oo ,<p(t) = j\X(0) = x,ip{0) = i] 




i £ to}, j £ 


here, H is a square matrix on x S- and is given by 
H = |C'_|- 1 A__ + |C_r 1 A_ + ^, 


where A _and A _(_ are submatrices of the generator A , indexed by S- x 

and 6>_ x <S+, respectively, and |C_| is a diagonal matrix with \ci\,i £ on 
the diagonal. A similar equation holds for x < 0. The matrices ’P and 'P are 
solutions of algebraic Riccati equations and their resolution has been the object 
of much attention. Very efficient algorithms are available, and we refer to Bini 
et al. [7] and Bean et al. [3]. 
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The Erlangian approximation method is introduced in [2] to determine the 
detailed distribution of r. The idea is that, to compute the probability 

F(t 0 ; i , x ) = Pr[r < f o |X(0) = x, 0) = i], x > 0, i G {1,..., to} 


for a fixed value to , it is convenient to replace to by a random variable T with an 
Erlang distribution, with parameters (n/to,n) for some positive integer n. The 
random variable T has expectation to and variance to/n, so that F{T\i,x) is a 
good approximation of F(to',i,x ) if n is large enough. From a computational 
point of view, the advantage is that one replaces systems of integro-differential 
equations by linear equations. 

The long and the short of it is that the original system is replaced by the 
process {(X(f), <f>(t))} with a two-dimensional phase <f>(t) = (/3(f), ip(t)) on the 
state space {1 ,n} x {!,..., to} U {0} and with the generator 


'A - vl vl 


Q = 


A-vI 


0 


0 0 


vl 

A-vI vl 

0 0 


where v = n/to- The physical interpretation is that the absorbing state 0 is 
entered at the random time T, and the component /? of the phase marks the 
progress of time towards T. Some authors (for instance [21 ITS] 1 report that good 
approximations may be obtained with small values of n. 

Because of the Toeplitz-like structure of Q 1 the matrices T and H are both 
upper block-triangular block-Toeplitz and it is interesting to use the Toeplitz 
structure in order to reduce the cost when n is large. This is done in m for the 
matrix H/. Here we address the question of efficiently computing the exponential 
matrix e Hx for a given value of x, where H has the structure of |l]). We shall 
assume without loss of generality that x = 1. 


1.2 Main results 

We recall that the exponential function can be extended to a matrix variable 
by defining 



For more details on the matrix exponential and more generally on matrix func¬ 
tions we refer the reader to Higham mi. 

The matrix T{U) defined in is of order nm and it may be huge, since 
a larger n leads to a better Erlangian approximation, while the size m of the 
blocks is generally small. The matrix T(f7) is a subgenerator, i.e., it has negative 
diagonal entries, nonnegative off-diagonal entries, and the sum of the entries on 
each row is nonpositive. 
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Since block-triangular block-Toeplitz matrices are closed under matrix mul¬ 
tiplication, it follows from ([2]) that the matrix exponential tUU) is also an up¬ 
per block triangular, block-Toeplitz matrix; in particular, the diagonal blocks 
of e r( ' U ' ) coincide with e u °. Moreover, it is known that the matrix eTU) is 
nonnegative and substochastic. 

The problem of the computation of the exponential of a generator has been 
considered in Xue and Ye mm and by Shao et al. na. where the authors 
propose component-wise accurate algorithms for the computation. These algo¬ 
rithms are efficient for matrices of small size. For the Erlangian approximation 
problem, these algorithms are useless for the large size of the matrices involved. 
Recently, some attention has been given to the computation of the exponential 
of general Toeplitz matrices by using Arnoldi method (Lee et al. [T^] , Pang and 
Sun [15]). 

In our framework, Toeplitz matrices are block-triangular so that they form a 
matrix algebra. This property is particularly effective for the design of efficient 
algorithms and we propose some numerical methods that exploit the block- 
triangular block-Toeplitz structure and the generator properties. Unlike the 
general methods, our algorithms allow one to deal with matrices T(U) of very 
large size. 

Two methods rely on spectral and computational properties of block-circulant 
and block e-circulant matrices (Bini |Bj, Bini et al. 0) and on the use of Fast 
Fourier Transforms (FFT). Recall that block e-circulant matrices have the form 

Un -1 1 


Ux 

(U n -1 Uo 

and that a block-circulant matrix is a block e-circulant matrix with e = 1. For 
simplicity, we denote by C(U ) the block 1-circulant matrix C\{U). 

Since block e-circulant matrices can be block-diagonalized by FFT [6], the 
computation of the exponential of an n x n block e-circulant matrix with mxm 
blocks can be reduced to the computation of n exponentials of mxm matrices. 
These latter exponentials are independent from each other and can be computed 
simultaneously with a multi-core architecture at the cost of a single exponential. 

The idea of the first method is to approximate e^U) by e c ‘U) where e £ C 
and |e| is sufficiently small. We analyse the error and are thereby able to choose 
the value of e which gives a good balance between the roundoff error and the 
approximation error. In fact, the approximation error grows as 0(e ) while the 
roundoff error is 0(/ue~ 1 ), where p. is the machine precision. This leads to an 
overall error which is 0(/z 1//2 ). By using the fact that the solution is real, by 
choosing e a pure imaginary number we get an approximation error 0(e 2 ) which 
leads to an overall error 0(/r 2 / 3 ). 

Since the approximation error is a power series in e, we devise a further 
technique which consists in averaging the solutions computed with k different 


Ce(U) = 


Uo Ux 
(U n _ x Uo 

eUx ... 
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values of e. This way, we are able to cancel out the components of the error of 
degree less than e 2k . This leads to a substantial improvement of the precision. 
Moreover, since the different computations are independent from each other, 
the computational cost in a multicore architecture is independent of k. 

In our second approach, the matrix T(U) is embedded into a K x K block- 
circulant matrix C(U^ K ^), where K is sufficiently large, and an approximation 
of e r ( u ' > is obtained from a suitable submatrix of e c ^ u< ' The computation of 
gC(u K ) j g rec j uce{ j to the computation of K exponentials of m x m matrices, 
and our error analysis allows one to choose the value of K so as to guarantee a 
given error bound in the computed approximation. 

The third numerical method consists in specializing the shifting and Taylor 
series method of m- The block-triangular Toeplitz structure is exploited in 
the FFT-based matrix multiplications involved in the algorithm, leading to a 
reduction of the computational cost. The algorithm obtained in this case does 
not seem well suited for an implementation in a multicore architecture. 

We compare the three numerical methods, from a theoretical as well as from 
a numerical point of view. From our analysis, we conclude that the method 
based on e-circulant matrices is the fastest and provides a reasonable approxi¬ 
mation to the solution. Moreover, by applying the averaging technique we can 
dramatically improve the accuracy. The method based on embedding and the 
one based on power series perform an accurate computation but are slightly 
more expensive. 

It must be emphasised that the use of FFT makes the algorithms norm-wise 
stable but that component-wise stability is not guaranteed. In consequence, the 
matrix elements with values of modulus below the machine precision may not 
be well approximated in terms of relative error. 

The paper is organised as follows. In Sections [2] and [3j we recall proper¬ 
ties of the exponential of a subgenerator and of its derivatives, and some basic 
properties of block-Toeplitz and block-circulant matrices which are used in our 
algorithms. In Section [4j we show how to compute the exponential of a block 
e-circulant matrix by using fast arithmetic based on FFT and we perform an 
error analysis. We present in Section [5] the algorithms to compute the expo¬ 
nential of T(U): first we analyse the decay of off-diagonal entries of the matrix 
exponential, next we describe the new methods and perform an error analysis. 
We conclude with numerical experiments in Section [6j 


2 The exponential of a subgenerator and its 
derivatives 

2.1 The exponential of a subgenerator 

A subgenerator of a Markov process is a matrix Q of real numbers such that the 
off-diagonal entries of Q are nonnegative, the diagonal entries are negative, and 
the sum of the entries on each row is nonpositive. We denote by 1 the column 
vector with all entries equal to 1, with size according to the context. If Q is a 


5 


subgenerator, then Q 1 < 0 and Q is called a generator if the row sum on all 
rows is zero. 

Let a = m.a,Xi(—qu). The matrix V = Q + a I is a nonnegative matrix, 
and we may write = e v ~ aI = e~ a e v . From the latter equality it follows 
that the matrix exponential is nonnegative. Moreover, since Q 1 < 0 it 
follows that VI = Q1 + al < al. Therefore, in view of p]), e^l = e~ a e x 1 = 
e~ a ESn Jf V l l < e~ a e a l. Thus we may conclude that e^l < 1, that is, e® is 
a substochastic matrix. 


2.2 Derivatives and perturbation results 

We recall the definition and some properties of the Gateaux and Frechet deriva¬ 
tives, and their expression for the matrix exponential function, together with 
some properties when the matrix is a subgenerator. We refer the reader to m 
for more details. 

The Frechet derivative of a matrix function / : C nxn —> C nxn at a point 
X € C nxn along the direction E £ C nxn is the linear mapping L(X, E) in the 
variable E such that 


f{X + E)~ f(X) - L(X, E) = o(\\E\\). (3) 


The Gateaux (or directional) derivative of / : <C nxn —► <C nxn at a point X £ 
C nXn along the direction E £ C" xn is 


G(X,E) = lim 

h-tO 


f(X + hE) - f(X) 
h 


(4) 


If the Frechet derivative exists, then it is equal to the Gateaux derivative am 
Section 3.2]). Such is the case for the matrix exponential function and we may, 
therefore, use either definition © or depending on which is more convenient; 
we will use the Gateaux derivative. From m, 


G(tX , E) 


e x ^~ s) Ee Xs ds 


(5) 


and the following equation gives an expression for the matrix exponential in 
terms of Gateaux derivatives: 


LI 

e t(x+ hE ) = y- 'J_G^(t.X,E) 

• ^ i I 

3 = 0 J ' 


( 6 ) 


where we denote by (tX, E) the j-th Gateaux derivative of the matrix func¬ 
tion e tx in the direction E, obtained by the recurrence equation 


G^ (tX, E) = j f eV-^EGV-V (sX, E)ds , j = 1,2,..., (7) 

Jo 

and G^{tX,E) = e tx . 
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Recall that if X is a subgenerator, then e tx is a substochastic matrix for 
any t > 0 and in particular \\e tx Hoc < 1. Therefore, by taking norms in ([5]), we 
obtain the upper bound 

||G(fA, -E^Hoo < f He^-^IU Halloo||e- Ys Hoods < tplU (8) 

Jo 

which may be extended to the j-th order Gateaux derivative as in the next 
proposition. 

Proposition 1 If X is a subgenerator, then 

IIG^Ab^lloo <tWoo (9) 

for t > 0, for any j > 0. Moreover, if E is a nonnegative matrix, then 
G^(tX, E) is nonnegative for any j > 0. 

Proof. Since the matrix X is a subgenerator, the matrix e rX is nonnegative 
and substochastic for any r > 0, therefore ||e TA || 00 < 1 for any r > 0. By using 
this property, the inequality 0 can be proved by induction. If j = 0, then 
||G[°l(tA',-E^Hoo < 1. The inductive step is immediately proved, since from 0 
we have 

||G^(GA, -E)||oo < j f lle^-^llooll^lloollG^-^CsA^^lloods 
Jo 

<3 [ t \\E\\i 0 si- 1 ds=\\E\\l 0 t J , 

Jo 

where the last inequality follows from the inductive assumption. If the matrix E 
is nonnegative, from the recurrence 0 and from the fact that e rX is nonnegative 
and substochastic for any r > 0, it follows by induction that G^ (tX, E) is 
nonnegative for any j > 0. □ 

The following result provides some bounds related to the exponential of 
the matrix T(U) of 0 and to its Gateaux derivative; it will be used in the 
next sections to analyse the stability of the algorithm in Section [572] based on 
e-circulant matrices. 

Theorem 2 Let T(U) be the matrix in Q and assume it is a subgenerator. 
For x = (Xi) i= i„ G C n_1 define H(x) = U 0 + YJiZi x iUi■ If \xf\ < 1 , i = 
1, then He^^Hoo < 1 for any s > 0. Moreover, ||G(T^(rc),£7)||oo < 
Halloo f or an D mxm matrix E. 

Proof. Define a = maxi(—(Uo)u), H = H(( 1,..., 1)) and B = H + al. From 
the choice of a it follows that B > 0. We have 

||e s "||oo = ||e sB - s “ 7 ||oo = e—“He-^IU < e- s “e" sB ll“ 

where the latter inequality holds in view of El Theorem 10.10]. Since B > 0 we 
may write that ||-B||oo = ||-Bl||oo- From the inequality (^fc=o ^4)1 < 0 we find 
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that (al + X]fe=o £4) 1 < al, that is B 1 < al whence HsRHoo < sa. Therefore 
we conclude that He^Hoo < 1 and the first claim is proved. 

Now, concerning H(x) we have 


^sH(x) gSH(x)-\-s<xI— sal ^-sct^sH(x)-\-saI 


( 10 ) 


Since |xfc| < 1, Uq + al > 0 and Uk > 0, k = 1,..., n — 1, we have 


n— 1 


n—1 


| H{x) + al | < \Uq + al\ + | ^ ' XkUk | < Uq + al + 'y ' Uk — B. 

fc=l fc=l 

By monotonicity of the infinity norm, we have \\H{x) + a/||oo < ||.B||oo, 

|| e si/(®)+sa/|| oc < e ||sff(a:)+SQ!7|| oc , < gUsBUoo & sa ^ 

and, from He^^Hoo = e -^ e sH(x)+ sa i ^ < L From 0, we have 

\\G(H( x )i -®)lloo < Halloo /' lle^-^^lloolle 8 ^ Hoods < I^IU 

Jo 

and the last claim follows. 


□ 


3 Fast computations with Toeplitz and circulant 
matrices 

In this section we recall some basic properties of block-Toeplitz and block- 
circulant matrices, useful for our computational analysis. We refer the reader to 
Bini and Pan [S] and Bini et al. [5] for more details. Given a matrix V € C mxn , 
we denote by V T and by V H the transpose matrix and the transpose conjugate 
matrix of V, respectively. The conjugate of a complex number z is denoted 
by z. 

Let i be the imaginary unit such that i 2 = — 1 and oj n = cos — + isin — 
be a primitive nth root of the unity. We denote by F = the 

Fourier matrix. Recall that F is nonsingular, that F -1 = 1F H and that, given 
a vector v £ C n , the application v — > u = Fv defines the inverse discrete 
Fourier transform (IDFT) of v. We assume that n is an integer power of 2, so 
that the vector u can be computed by means of the FFT algorithm in |nlog 2 n 
arithmetic operations (ops). The application u —> v = -F H u is called Discrete 
Fourier Transform (DFT) and the vector v can be computed in |nlog 2 n+n ops. 

Given the m x m matrices Vi, i = 0,..., n— 1, we denote by V = (V))i=o,n-i 
the block-(column) vector with block-entries Vi, i = 0,..., n — 1. Finally, we 
define T = F®I m , where (g> is the Kronecker product and I m the identity matrix 
of order m. This way, for a block-column vector V the matrix U = FV can be 
computed by means of mr IDFTs with |nw 2 log 2 n ops. Similarly, given the 
matrix U, the block-vector V = \F h U can be computed with | nm 2 log 2 n + 
nm 2 ops. 



3.1 Block-circulant matrices 


For the results in this section we refer the reader to the book |9j and to the 
references cited therein. 

Given the block-vector U = (f/j)i=o,n-i> with mxm blocks, the nxn block- 
circulant matrix C{U) = (Cij)ij=o,n-i associated with U is the matrix with 
block-entries 

U, .j — Uj—i mod n 

so that [Uq, ..., U n _ i] coincides with the first block-row of C(U) and the entries 
of any other block-row are obtained by the entries of the previous block-row by 
a cyclic permutation which moves the last block entry to the first position and 
shifts the remaining block-entries one place to the right. For instance, for n = 4 
one has 

' U 0 t/i U 2 U 3 

r(m _ U 3 U 0 lh U 2 

[ ’ U 2 U 3 U 0 lh ■ 

U\ U 2 U 3 Uq 

Observe that a block-circulant matrix is a particular block-Toeplitz matrix. 

Block-circulant matrices can be simultaneously block-diagonalized by means 
of FFT, that is, 


- X H C(U)X = diag(Vo,..., 14_r), V = TU. 
n 

This property shows that block-circulant matrices are closed under matrix mul¬ 
tiplication, i.e., they form a matrix algebra, moreover the product of a circulant 
matrix and a vector can be computed by means of Algorithm [l] This algorithm 
performs the computation with 2 m 2 FFTs and n matrix multiplications. Since 
2m 3 — to 2 ops are sufficient to multiply two mxm matrices, the overall cost of 
Algorithm [T] is 3 nm 2 log 2 n + nm 2 + (2m 3 — m 2 )n ops. 


Algorithm 1: Product of a block-circulant matrix and a block-vector 
Input : Two block-vectors A' = (A' i ) i=0jn _i, U = (Gi)j =0jra _i 
Output: The block-vector Y =C(U)X , Y = (Yi)i= o,n-i 

1 V = TU 

2 Z = TX 

3 Wi = ViZi, i = 0 ,. .., n - 1 , W = {Wi)i=o,n -1 

4 Y = \T H W 


If the input block-vectors are real then the vectors V = XU and Z = XX 
have a special structure, that is, the components Vo, Z§ and V n / 2l Z n i 2 are real 
while Vi = V n _i, Zi = Z n _ il for i = 1,..., n/2 — 1. In this case, the number of 
matrix multiplications at step 3 of Algorithm [l] is reduced to n/2. 
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Remark 3 Observe that the product of two circulant matrices may be com¬ 
puted by means of a product of a circulant matrix and a vector by means of 
Algorithm[I] In fact, since the last column of the block-circulant matrix C(U) 
is the block-vector U = if C(Y) = C(U)C(X) then we find that 

1 — CiU^X ^ where X — (A n _*_ i)*=o,n—1, X — (In—z—1)^=0,n-i- 


3.2 Block-triangular Toeplitz matrices 


We denote by U the block-vector (Uj)j=o,n-i and by T(U) the block-upper 
triangular block-Toeplitz matrix whose first row is [Uo, ..., U n -i\. For n = 4, 
for instance, 


T(U) = 


U 0 U x U 2 U 3 

0 Uo Ui u 2 

0 0 Uo u t 

0 0 0 Uo 


Block-upper triangular block-Toeplitz matrices are closed under matrix multi¬ 
plication. 

Consider the vector U of 2n components obtained by filling the vector U 
with zero blocks, and the block-vector V = (I 7 )i=o,n-i such that V 0 = 0 and 
Vi = U n -i for i = 1,... ,n — 1. Then the matrix C(U) can be partitioned as 


follows 


C(U) 


T(U) C(V) 
C(V) T{U) 


( 11 ) 


where C(V) is the block-lower triangular block-Toeplitz matrix whose first block- 
column is V. This expression enables one to compute the product Y = T{U)X 
of a block-upper triangular Toeplitz matrix and a block-vector with a low num¬ 
ber of arithmetic operations. In fact, from ( |TT] ) one deduces that Y coincides 
with the first half of the block-vector Y = C(U)X where X is the block-vector 
of length 2 n obtained by filling X with zeros. This fact leads to Algorithm [2] for 
computing the product of a block-triangular block-Toeplitz matrix and a block- 
vector. The cost of this algorithm is 6?rm * 1 2 log 2 (2n) + 2 nm 2 + 2(2 m 3 4 — m 2 )n 
ops. 


Algorithm 2: Product of a block-triangular block-Toeplitz matrix and a 
block-vector _ 

Input : Two block-vectors X = (A'i)j = o, n _i, U = (f7j)j = o, n -i 
Output: The block-vector Y = T(U)X, Y = (Yi)i = o, n -i 

1 Set X = (Xi)i= o, 2 n-i with X t = Xi for i = 0,..., n — 1, X t = 0 for 
i = n,... ,2n — 1 

2 Set U = ( Ui ) with 17* = Ui for i = 0,..., n — 1, Ui = 0 for i = n,..., 2n — 1 

3 Apply Algorithm [I] to compute Y = C(U)X 

4 Set Y = (Yi)i— o, n —i with Yi =Y iy for * = 0,..., n — 1. 
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Remark 4 Observe that the product of two block-triangular block-Toeplitz 
matrices can be computed by means of a product of a block-triangular block- 
Toeplitz matrix and a block-vector by means of Algorithm [2] In fact, since 
the last column of T(U) is the block vector U = i)i=o,n-i, if T(Y) = 

T{U)T(X) then we find that Y = T{U)X , where X = (A' rl -i-i)i=o,n-i, Y = 

(Yn—i— 1 )i=0,n— 1 • 


3.3 Block-e-circulant matrices 

Given a block-vector U and a complex number e, the block-e-circulant matrix 
C € (U) = ( Ci t j ) is defined by 

for j > i, 
for j < i. 


U 2 U 3 
Ui u 2 
U 0 U t ' 
eU 3 Uq 

Observe that a block-e-circulant matrix is a particular case of block-Toeplitz 
matrix and that, for |e| small, a block e-circulant matrix is an approximation of 
a block-triangular block-Toeplitz matrix. 

Like block-circulant matrices, block-e-circulant matrices can be simultane¬ 
ously block-diagonalized by means of FFT, so that they are closed under matrix 
multiplication and form a matrix algebra as well. In fact, one can show that 

-J 7 'D~ 1 C e (U)'D <i J 7H = diag(Vb, • • ■, V n -\), V = T H V t U, (12) 
n 

where 

V e = D e ® I rn , D e = diag(l, 9, d 2 ,..., r" 1 ), 9 = e l ' n . 


Ci,j = 


Uj-i 


For instance, for n = 4 one has 


Ce(U) = 


U 0 

Ui 

eU 3 

U 0 

eU 2 

eU 3 

ef/i 

eU 2 


4 The exponential of a block-e-circulant matrix 


Let U = (I/,;) i= o >n _i be a block-vector of length n where Ui £ C mxm 
the block-e-circulant matrix C e (U) and its matrix exponential e c ^ u ' > 


consider 
In view of 


(12), we find that 


e c e {U) = -V e X H dia.g{e Vo ,..., e^”- 1 ) TV~ X , V = T H V f U. 


Therefore the exponential of a block-e-circulant matrix is still block-e-circulant. 
Moreover, we have e c ‘^ = C e (Y) where 


Y=-V~ X TW, W = (W i ) i=0 ,n-i, Wi=e Vi , i = 0,...,n-l, (13) 

n 
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and V = T H VfU. The above equations allow to compute the exponential of an 
n x n block-e-circulant matrix by computing n exponentials of m x m matrices 
and two Fourier transforms, as described in Algorithm [3] 


Algorithm 3: Exponential of a block-e-circulant matrix 

Input : A complex number e, the block-vector U = (JJi)i= o,«-i 

defining the first block-row of the e-circulant matrix C e (U) 
Output : The block-vector Y = (li)j = o irj -i such that 

C e (Y) = e c 'W 

1 Z = VJJ 

2 v = t h z 

3 Wi = e Vi , i = 0,..., n — 1, and set W = (Wj),=o,n-i 

4 R= \TW 

5 Y = V~ X R 


Observe that the multiplication of U by the diagonal matrix T> e at step 1 
reduces to scaling the blocks Ui by the scalar 9i. The multiplication by Vj 1 at 
step 5 performs similarly. Therefore the overall cost of the algorithm is given 
by 3 m 2 n log 2 n + 3 m 2 n ops plus the cost of computing n exponentials ofmxm 
matrices. 

For e = 1 the block-e-circulant matrix turns to a block-circulant matrix and 
Algorithm [3] takes the simpler form described in Algorithm |4j The computa¬ 
tional cost in this case is reduced to 3 m 2 n log 2 n + m 2 n ops plus the cost of 
computing n exponentials ofmxm matrices. 


Algorithm 4: Exponential of a block-circulant matrix 
Input : The block-vector U = (Ui)»=o,re-i defining the first 

block-row of C{U) 

Output : The block-vector Y = (Tj)i = o, n -i such that C(Y) = e c( - u ^ 

1 v = R h U 

2 Wi = e Vi , i = 0,..., n — 1, and set W = (Wj),- =0in _i 
a Y= 

n 


4.1 Numerical stability 

Let U = (/ 7 i)i=o,n-i be the block-vector defining the first block row of the sub¬ 
generator T(U). We analyze the error generated by computing the exponential 
of the block-e-circulant matrix C e (U ) by means of Algorithm [3] in floating point 
arithmetic, where e£C with |e| < 1. 

Here and hereafter fl(-) denotes the result computed in floating point arith¬ 
metic of the expression between parenthesis. The symbol = denotes equality up 
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to lower order terms, and similarly the symbol < stands for inequality up to 
lower order terms. The symbol y denotes the machine precision. 

We recall the following useful fact (see [HI page 71]) 

&(xy) = xy(l +rj), \rj\ < V ='■ Ph, /3 = 2 V2, (14) 

for x, y G C, and we use the following properties involving norms, where v G C n 

IMloo < Nk IMI2 < VnMoo, ||u|| 2 < |M|i, 

Halloo < n||u||oo, ll^lb < Vn\\v\\ 2 - 

In order to perform the error analysis of Algorithm [3] we recall the following 
result concerning FFT (see m page 453]). 


Theorem 5 Let x be a vector of n components, n = 2 q , q integer, y = Fx, 
where F = (w(f)ij=o,n-i the Fourier matrix. Let y be the vector obtained 
in place of y by applying the Cooley- Tukey FFT algorithm in floating point 
arithmetic with precision y where the roots of the unity are approximated by 
floating point numbers up to the error v. Then 


I \y - vh < wt 
hh ~ 1 -qy 1 


p = v + (■\pi + v) --—. 

1 — Ay 


In particular, with v = y and performing a first-order error analysis where 
we consider only the part of the error which is linear in y we have 


\\y-yh 

Wvh 


< 


7 = Ay/2 + 1. 


(16) 


Observe that, since F H = F, we may replace F with F H in the statement of 
Theorem [U 

We split Algorithm [3] into three parts. The first part consists in computing 
the entries of the matrices 14 by means of steps 1 and 2, the second part 
consists in computing the entries of II 4 = e Vk and the third part is formed 
by the remaining steps 4 and 5. The first and third part can be viewed as 
the collection of m 2 independent computations applied to the entry (r, s) of 
the generic block for r,s = 1 More specifically, given the pair (r, s), 

denote u = (itfc), z = (z/), v = (vk) G C n the vectors whose components are 
( Uk) r ,s j {Zk)r,si (I4) r , s , k = 0,... ,n — 1, respectively. The computation of u 
is obtained in the following way: 6 = e 1//n , Zk = 0 k Uk , for k = 0,... ,n — 1, 
v = F H z. While, denoting w,r,y G C" the vectors whose components are 
(kFfc)r,s) (Rk)r,s, (^4)r,s) k = 0,..., n — 1, respectively, the computation of y is 
obtained in the following way: r = -Fw , yk = 9~ k rk for k = 0,..., n — 1. 

Define S z = z — z, 5 V = v — v, S r = f — r, S y = y — y where z, v, f, y are 
the values obtained in place of z, v, r, y by performing computations in floating 
point arithmetic. We denote also by (S r )k = hk — Dc and (6 y )k = yk — yk the 
fc-th component of S r and S y , respectively. 
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In our analysis we assume that the constants 0 k have been precomputed 
and approximated with the numbers Ok such that Ok = 0 k { 1 + 07 ), \<Jk\ < //, 
k = —n + 1 ,... , 0, ... ,n — 1 . 

Since Zk = 0 k Uk , from (14) we find that &(0 k Uk) = 0 k Uk{ 1 + rjfc)(l + cr fc ) = 
d fc it fc (l + r/k + cr fc ). Thus, 


Halloo < KINloo, C = /3 + f- 


(17) 


Denoting by S' the error introduced in computing the FFT v of z in floating 
point arithmetic, we have 

Sv = F H 5 Z + 6 , 


and in view of (TT5|) , (16), and (TT7T) we obtain 


IIMoo ^ ^ll^lloo + ||<5'|| 2 < n||5 z ||oo +At7log 2 n||u|| 2 

< nCn ||z||oo + M7 ri l°g 2 ri lklloo 

= /HMI°°(C + 7log 2 n) 

< /in||u|| 00 (C + 7l°g 2 ri), 


where the last inequality follows from the fact that H^Hoo < HuHoo since Zk = 
0 k Uk and \0\ < 1 . This implies that Ay k = 14 — 14 is such that 

max |(Ay fe ) r J < /m( ( + 7 log 2 n) max \(U k ) r ,«|, 

k k 


which yields 

||AyJ|oo < mmax|(A Vfc ) riS | < /xmn(C + 7 log 2 n) max |({7/ l ) f . )S |. (18) 

k r,s,h 

Concerning the second part of the computation, for the matrix A^ = 1-14 — 
Wfc we have 

Avy fc = &(e Vh ) - e Vk , fl (e Vk ) = + E k (19) 

where Ek is the error generated by computing the matrix exponential in floating 
point arithmetic. Here we assume that ||£4||oo < A tr l|l^fc|| 0 o for some positive 
constant r which depends on the algorithm used to compute the matrix expo¬ 
nential. From the properties of the Gateaux derivative one has He^ — e Vk || = 
||G(14,AyJII, and from Theorem [5J applied with Xi = u)‘ r k 0 l . i = l,...,n— 1 , 
it follows that ||G(I4, AyJHoo < IJAyJoo and ||Wfc||oo < 1- 

Combining these results with © leads to the bound 

l|AwJ|oo < HAyJloo +/J.T. (20) 

Finally, for the third part of the computation, consisting of steps 4 and 5, 
we have 

S r = —F5 W H— 5" 
n n 
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where 5" is the error obtained by computing Fw in floating point arithmetic. 
Thus from (161 we have 


IK 


< 


-\\™« 
n 

fi: 11 8w || oo ” 


loo +/Z 7 log 2 n||r|| 2 

Wy/n\og 2 n\\y\\ 

OO J 


( 21 ) 


where the second inequality holds from © and from r k = 0 k y k since |0| < 1. 
Moreover, we find that 

{8y) k =6~ k (8 r ) k +0~ k r k v k = 6~ k ((S r ) k + y k v k ), \v k \ < (p. (22) 

Now we are ready to combine all the pieces and obtain the error bound on 
the computed value Y. From (22) we get 

\{ 5 y ) k \ < |0r fc (||<Mloo + CmIMIoo)- 

On the other hand, by using (pll), we find that 


\(8y)k\ < \e\ 


-k( 


(C + 7\/^log 2 ^)/u||y||oo)- 


Thus we have 

HAyJloo < m\(S y )k\ < m|6>|“ fc (max||An/ h ||oo + (C + 7v / ™k>g 2 n)p max Halloo)- 


Moreover, from ( |20[ ) and (181 we conclude with the following bound 

|| Ay Joe < m\6\~ k (ma,x\\A Vh \\ 00 + pr + ((+ 'y^h\og 2 n)pma,x\\Y h \\ 00 ). 

h h 


Whence 

HAyJloo < y,m\0\- k (rnn{(+'ylog 2 n)m&x\(U h )r,s\+T+((+"fVn\og 2 n)max\\Y h \\ oo ) 

r,s,h h 

and we may conclude with the following 

Theorem 6 LetY k be the value ofY k provided by Algorithm^applied in floating 
point arithmetic with precision p for computing C e (Y) = e Cc ^ u \ where Y = 
(Y k ) k =o,n-i, U = {U k ) k = 0 ,n-i- Denote A Yk = Y k - Y k . One has 

HAyJloo < pe~ 1 nup 


where 

(p = mn{ C + qlog 2 n) max \(Uh) r ,s\ + (C + 7v / ™log 2 n) max^H,*, + r, 

r,s,h h 

C = 1 + 2v / 2 , 7 = 4a/2 + 1 , and rp is the error bound in the computation of the 
matrix exponential, i.e., such that \\fl(e v ) — e v ||oo < /irUe^Hoo for an m x m 
matrix V. 
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In the case where e = 1, we apply Algorithm [4] to compute the exponential 
of a block-circulant matrix and the above result leads to 

Theorem 7 Let Yj. be the value of Yk provided by Algorithm [^] applied in 
floating point arithmetic with precision /r for computing C(Y ) = e c '- u \ where 
L — o^n—l? U — {Uh)k=Q,n — l• Denote Ay*. — Yk- One has 

HAyJloo < M™X 

where \ = mny log 2 nmix^,/, |(t4)r, s | + lVn\og 2 nmax t Halloo + t, and ( = 
1 + 2 V 2 , 7 = 4-\/2 + 1, and r/i is i/ie error bound in the computation of the 
matrix exponential, i.e., such that \\fl(e v ) — e^Hoo < M T l|e y ||oo for an m x m 
matrix V. 


5 The exponential of a block-triangular block- 
Toeplitz matrix 


Let U = (Ui)i = o, n -i be the block-vector defining the first block-row of the 
subgenerator T{U) of ([l]). Since block-triangular block-Toeplitz matrices form 
a matrix algebra, by using the Taylor series expansion of the matrix exponential, 
it follows that e T ^ u ' 1 is still a block-triangular block-Toeplitz matrix. Denote by 
A = (Ai)i = o tTl _i the block-vector defining the entries on the first block-row of 
e r ^ u \ i.e., such that T(A) = e r( ' U \ In particular, we have Ao = e u °. 

Let K > n and define the K dimensional block-vector U^ obtained by 
completing U with zeros: 


= (Un^K-r, U™ = 


Ui for i = 0,..., n — 1, 
0 for i = n,..., K — 1. 


(23) 


Consider the K x K block-triangular block-Toeplitz matrix T(E/^). In view 
of [HI Theorem 3.6], if K 2 > K\ > n, then e '7~( i7(ifl) ) i s the principal K\ x K\ 
block-submatrix of e r( ' U< ’ Denote by A^ = (Ai)i = o : K-i the block-vector 
defining the first block-row of e 7 W (A) ), i.e., A^l is the block-vector such that 
T(AW) = e r (- um \ 

Let U = {Uf)i—Q n —i be such that 


Uo = U 0 + aI, Ui = Ui, i=l, . 1, (24) 

where a = maxj(—(Uo)jj). Define the block-vector U( K l with block-components 
U- K) = Ui for i = 0, ..., n - 1, and (j\ K) = 0 for i = n, . .., K - 1. Ob¬ 
serve that T(EA K )) = T(E/ (A) ) + olI is a nonnegative matrix, and we may 
write e r ( u(K) ) = e~ a e r A J '' K) ) _ We denote by A^ = {Ai)i=o t K-i the block- 
vector such that T{Af K ">) = e r ^ ( \ In particular we have Ai = e~ a Ai, 
i = 0,...,K -1. 
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5.1 Decay properties 

In this section we investigate decay properties of the exponential e ’ r(T7 “ 1 of a 
subgenerator, in the case where the subgenerator is a banded block-triangular 
block-Toeplitz matrix. These properties will be used in Section [5.3| to estimate 
the approximation error of the numerical method based on the embedding into 
a block-circulant matrix. 

Decay properties of matrix functions have been analyzed in the literature. 
We refer the reader to the survey paper [5] and to [4] . In our case the structure 
and sign properties play an important role. The matrix exponential eT^*' is 
not banded in general, but its off-diagonal entries have useful decay properties 
for K —> oo. To prove this fact we need the following result [S] Theorem 3.6] on 
decay properties of analytic functions. 


Theorem 8 Let H(z) = z l Hi be an m x m matrix power series analytic 

for z € C with |z| < R, and R > 1. For any 1 < cr < R, the block-coefficients 
satisfy 

m < M(a)a~\ i = 0,1,... (25) 


where M(a) is the m x m matrix with elements max|.| =0 . \h rs (z)\, for r,s = 
1,..., m, and the inequality (25) is meant componentwise. 


The following result provides bounds to A t , i = 0,..., K — 1. 


Theorem 9 Let K > n and let T{A^ K 1) = with = ( Ai)i=o,K-i, 

where T(U) in ([l]) is a subgenerator and is defined in (23). For any a > 1, 
we have 

Ail < e “C <r ” _1 - 1 ) (7 - i i ) i = 0,...,K-l, 
where a = maxj(— (Uo)j,j). 


Proof. We associate with the block-vector U = ( Ui)i=o t n-i of (24) the mxm 
matrix polynomial U(z) = Y^h =o zh Uh- For the properties of block triangular 
block-Toeplitz matrices [ 8 ], the matrix is still a block-triangular block- 

Toeplitz matrix and the blocks in its first row are the coefficients of the matrix 
polynomial P^\z) = U(zy mod z K . Let be the matrix coefficient of 
degree i of P^\z), for i = 0,..., K — 1. From the power series expression of 
the matrix exponential we find that 



i = 0,..., K - 1. 


(26) 


We want to give an upper bound to the matrices P- J> . Since U(z is a matrix 
polynomial, then it is analytic in all the complex plane and we may apply 
Theorem [ 8 ] with Ft(z) = U{zY and any cr > 1. We have to estimate the matrix 
M(cr). The matrix coefficients of U(z) are nonnegative, therefore for any cr > 1 
and for any z £ C with |z| = a , we have |C/(z) J '| < U(aY < (f7(l)cr ra " 1 ) J . Since 
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T(U) is a subgenerator then {7(1)1 = (al + J2h= o^)l < oil. So that we 
obtain |{7(z)- 7 |l < Since P^\z) = U(zY mod z K , then 


is 


the matrix coefficient of degree i of U(z)° and, in view of (25), we find that 


Pf 1 < a?(j^ n *1. From this inequality and from (26) we obtain that for 
any a > 1 and for * = 0,..., K — 1 


A 1 < -a j a 
j =o 


in-l)j a -i 1 = a ~i e acr 


1 . 


Since A, = e “A; we conclude the proof. 


□ 


5.2 Method based on e-circulant matrix 

Let e £ C with |e| sufficiently small, and consider the block-e-circulant matrix 


cyu) = 


u 0 t/i 

ef7„_i Uq 

eU, ... 


eU n - 


n— 1 


U n -1 


U\ 

U 0 


(27) 


The exponential of C e (U) is still a block-e-circulant matrix, that can be 
computed by means of Algorithm [3j Denote by Y = (Y))j = o, n -i the block- 
vector such that C e (Y) = e c ^ u \ The idea is to approximate the blocks A;, 
defining T(A) = e r ^ u \ by the matrices Yi , for i = 0,. .., n — 1. 

In order to estimate the approximation error, observe that the matrix C e (U) 
can be written as C e (U) = T(U) + eL, where 


0 0 ... 0 

u n -1 o ■■. : 

: 0 

J7i U n -1 0 


(28) 


This property allows to give the following estimate: 


Theorem 10 Assume that 7~{U) is a subgenerator, and that e £ C. One has 
\\ e T(U) _ < e |e|||£|U _ L 


Moreover, if e is a pure imaginary number, then 


||e r ^-Re(e c '^)|| 00 <el £ l 2 ll i ll» 


- 1 , 


where Re(e C€ ^^) is the real part of e Ce ^ u \ 
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Proof. According to §, 


e CA u ) = e T{u) + J2 e -G\ j ]{T(U),L), 


j =i 


(29) 


where (T(U). L) are defined by means of ([?]). From Proposition [l] we obtain 

IIe r(c/ ) - e^lU < Yi ^||GM(T(C/),L)||oo < £ ^ll^llio = e |e|l|i|U - 1. 


3=1 


3=1 


If e is a pure imaginary number, since is a real matrix, the inequality is 

obtained by comparing the real parts in (29) and by applying Proposition [I] □ 

It is interesting to observe that the choice of an imaginary value for e provides 
an approximation error of the order 0(|e| 2 ) instead of 0(|e|). The idea of using 
an imaginary value for e was used in [T] in the framework of Frechet derivative 
approximation of matrix functions. 

The error bound can be improved by performing the computation with sev¬ 
eral different values of e and taking the mean of the real parts of the results 
obtained this way. For instance, choose ei = (1 + i)y/2e, ti = —ei, where e > 0, 
and recall that e C ‘^ U \ j = 1,2 are power series in e. Taking the arithmetic 
mean of e C 'i ( u ' > and e C ‘^ u \ the components of odd degree in e cancel out while 
the coefficient of e 2 is pure imaginary. Therefore taking the real part of the 
arithmetic mean provides an error 0 (e 4 ). 

This technique can be generalized as follows. Choose an integer k > 2 and 
set ej = (i) 1 //fc w^,e, j = 0 ,..., k — 1 , where (i) 1/,fe is a principal fc-th root of i. 
Then one can verify that the arithmetic mean of j = 0,..., k — 1 is a 

power series in e fc , moreover, is a pure imaginary number so that the real 
part of this mean provides an approximation with error 0 (e 2fc ). 

Observe that computing the exponential for different values of e might seem 
a substantial computational overload. However, in a parallel model of computa¬ 
tion, the exponentials e Cc i ^, j = 0 ,..., k — 1 , can be computed simultaneously 
by different processors at the same cost of computing a single exponential. 

Algorithm [5 reports this averaging technique. 

Theorem M provides us with a bound on the error generated by approxi¬ 
mating the exponential of a block-upper triangular Toeplitz matrix by means of 
the exponential of a block-e-circulant matrix. In fact, in practical computations 
in floating point arithmetic, the overall error is formed by two components: one 
component is given by the approximation error analyzed in Theorem |10[ the 
second component is due to the roundoff and is estimated by Theorem [ 6 ] More 
precisely, the effectively computed approximation in floating point arithmetic 
is the block-vector with components Yfc = + A y k , k = 0 , ..., n — 1 , where 
|| Ay fc | joo is bounded in Theorem [ 6 j On the other hand, Yfc = Aj. + E' k where, 
by Theorem 10 E' k is such that 


Po 


m 2 \\m 2 oo) 

mmu 


if e is imaginary 
otherwise 
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Algorithm 5: Exponential of a block-triangular matrix by means of e- 
circulant matrices and averaging 

Input : The block-vector U = (C/*)*= o.n-i defining the first 

block-row of T{U)\ a real number e > 0 ; an integer k > 0 . 
Output : The block-vector Y = (Yi)i = o, n -i that is an 

approximation of the first block-row of eT^. 

1 Set ej = (i) 1/k Lu{e, j = 0,..., k - 1 

2 Compute the first block row W ^ of e C ‘-> ^ U \ j = 0,..., k — 1, by means of 
Algorithm [3] 

3 Set Y = \ ES Re(W«)) 


where ip(t) = e t — 1. This way, for the overall error Ek = A y k + E' k one has 

Halloo < IIAvJloo + Halloo < + 

for t = |e|||i||oo, or t = |e| 2 ||i||^. 

This shows the need to find a proper balance between the two errors: small 
values for |e| provide a small approximation error H-E^Hoo but the roundoff errors 
diverge to infinity as e —>• 0. A good compromise is to choose e so that the upper 
bounds to ||-E^|| and || Ay k ||oo have the same order of magnitude. Equating these 
upper bounds in the case of non-imaginary e yields 

|e| = ^"w/ll-^lloo, Halloo < 2 e||i||oo 

and in the case of imaginary e, 

l e l = Halloo < 2 e 2 ||L||^ 0 . 

The latter bound is an 0(/x 2 / 3 ). This implies that asymptotically, as /i —>■ 0, we 
may loose 1/3 of the digits provided by the floating point arithmetic. 

If we adopt the strategy of performing the computation with k different 
values of Cj = (i ) 1 / k LU J k e, j = 0 ,..., k — 1 , so that the approximation error is 
0 (e 2fc ), then the total error turns to 0 (/z 2fe / 2fe+1 ), i.e., only l/( 2 fc + l) digits are 
lost. 

An interesting point is that the quantities ||T||oo and max riS ^ |(f/? l ) r . jS | are 
involved in the expressions of the error bound. Since T(U) is a generator, both 
these quantities are bounded from above by a = However, by 

means of simple manipulations, we may scale the input so that it is bounded by 
1 . This is performed by applying to T{U) the scaling and squaring technique 
of [121 . 

Let p > 0 be an integer such that a < 2 P . Then, since = (e 7 d c/ / 2P )) 2P , 
we first compute e 7_( - t/ / 2P ) anc [ then recover e k{u '> by performing p repeated 
matrix squaring. In this way we have ||L/2 P || 00 < 1 and max r s ^ |(/7ft) r>s /2 p | < 
1. Since T{U/2 P ) is still a generator, the error analysis performed for e 
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applies as well, and we can approximate the first block-row of eTK> l 2P ) with 
the first block-row Y = (Y)) of e 7 ^ C7e / 2P ) for a suitable e £ C with |e| < 1. 
Finally we recover an approximation to e T ^ u ' ) by computing T(Y) 2P by means 
of p repeated squarings, by using the Toeplitz structure and Algorithm [2j in 
view of Remark [4] The overall procedure is described in Algorithm [6] 


Algorithm 6: Exponential of a block-triangular block-Toeplitz matrix by 
using e-circulant matrices 

Input : The block-vector U = (£/*)*= o,n-i defining the first 

block-row of T(U), e £ C 

Output : The block-vector Y = (Yi)j = o,n-i> that is an 

approximation of the first block-row of e r ^ u ' ) 

1 a = maxj(—(Uo)jj), p = Lk>g 2 aj + 1 and U = U/2 P 

2 Compute Y, the first block-row of e Ct ^\ by means of Algorithm [ 3 ] 

3 If e is imaginary, replace Y with the real part of Y 

4 for r=l,...,pdo 

5 | compute T(Y) = T(Y)T{Y) 

6 end for 


5.3 Embedding into a circulant matrix 

The idea of this method is to embed the matrix T(U) into a K x K block- 
circulant matrix C(U^). The first block-row of e r< ' U ' > is approximated by 
the first n blocks of the first block-row of e c ( u(K) )_ Specifically, take K > n 
and consider the block-vector U'- K ' 1 defined in (231. The block-circulant matrix 
C(U ( - K ' > ) may be partitioned as 


C{U (K) ) = 


T(U) P 
Q T(U^ K -^) 


where P and Q are n x (K — n) and (K — n) X n block-matrices, respectively. 

Denote by £\ and by 8k the ( mnK ) x (mn) matrices formed by the first 
mn and the last mn columns, respectively, of the identity matrix of size mnK. 
The matrix C{U( K ^) can be also written as 


C(UW) = T(P (k) ) + H k , H k = E k LS f, (30) 


where the matrix L is defined in ( [28] ) . Because of the triangular Toeplitz struc¬ 
ture, the desired matrix e r< ' U ' ) is identical to the n x n block-leading submatrix 
of e 7 ^ < - u ' > . Our idea is to approximate the first block-row of e riyU ' 1 with the first 
n blocks of the first row of e c ( u{K) ). As pointed out in Section S e C(U^) is 
a block-circulant matrix, and can be computed by means of Algorithm [4] with 
3m 2 K log 2 K + m 2 K ops, plus the cost of computing K exponentials of m x m 
matrices. 
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Denote by S ^ = (S'| A ^)j = o,jf-i the first block-row of e c ( !7<A) ), so that 
C(S I ' K -*) = e c ^ c/< / An approximation of the matrices A,, i = 0, 1, 

defining the first block-row of e r( ' U ' > is provided by S^ K \ i = 0,..., n — 1; as K 
increases, the approximation improves, as shown by the following result. 


Theorem 11 Let e= T(A), with A = (Aj)j = ,o,n-i- Let K > n and let 
e C((7 (if) ) _ C{S^ K )), with SW = One /ias — A; > 0 /or 

i = 0,..., n — 1, and 




— An 


/or any a > 1, where 


o(K) 

°n—1 


Ay, —1 


< fK{cr) 


(31) 


/if (c) = (e" L "” - l^" -1 - 1 ) 


T —if+n 


1 — a 


with a = maxj(—(Uo)jj) and L defined in (28). 

Proof. By using (301 and Q , we find that 

oo 1 

e c( u ^) e nu <*>) = ^ i_ G bi (t(u^ k )),h k ). 

i=i J ' 


Equating the first n blocks in the first block-row in the above equation yields 


S { 0 K) - A 0 


q(K) a 
*n-l “ A n-1 


= £> w . 


^' 7! 
J=1 J 


(32) 


where is the block-row vector formed by the first n block-entries in the 
first block-row of G^(Tk{U), H K ). That is, 

W® =£[GW(T(U {K )),H K )£ i, 


where £\ is the mnK x m matrix formed by the first m columns of the identity 
matrix. Since Hk > 0, from (32) and from Proposition [I] we deduce that 
> 0 so that S[ K ^ — Ai > 0 for i = 0,..., n — 1. On the other hand, in view 
of ([7]) and from the fact that Hk = £k L£fi . we may write 


W® =j[ £) r e (1 - s)r(£/<A)) £' i fL£' 1 T G b - 1 ](sT(/7 (K) ),lP i f)£'ids 

Jo 


= J 


V(s)LZ^~ 1 \s)ds 


(33) 


where V(s) = [ Vo(s) ... V„_i(s) ] is the block-row vector formed by the 

last n block-entries of the first block-row of e^ 1_s ^ 7 ”^ !7< and Z^~ 1 \s) is the 
n x n block leading submatrix of G^~ l \sT{U^ K ^), Hk)■ 
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Since the matrix (1 — s)T(U^ K ^) is a subgenerator, it follows that Vi(s) > 0 
and, from Theorem [9j that for any a > 1, 

Vi(s) 1 < e ^-^^ n - 1 -^) a -K+n-i 1 < e °-^ n - 1 - 1 ) a -K+n-i 1 , 

for * = 0, ...,n — 1, where the latter inequality follows from the fact that 
1 — s < 1. This implies that 


n —1 


||V(s)||oo < e Q(CT " ^ a ~ K + n ~i < e a ^ n ~^ 


r -K+n 


i=0 


1 — a 


-l' 


Moreover, since sT(U is a subgenerator, Hk is nonnegative and ||iJif||oo = 
||L||oo, then, from Proposition |lj we have G^~ 1 \sT(U ( ^ k ' ) ), Hk) > 0 and 

HGb- 1 ]( s r(c/( if) ),^)||oo < s ^ wlw ^- 1 . 

This latter inequality implies that (s)||oo < s J ' _1 ||.L||£^ 1 . Therefore, by 

taking norms in (331, we find that 


Halloo < j f 1 ||y(s)|| 00 ||L||oo||Zy- 11 (s)||oods 
Jo 


<m\ 


j p«( ctTI 1 -1) a 


—K-\-n /*! 


1 — a 


i ri- 

T / f- 1 ** = \\L\\ 

Jo 


-i rr—^+n 

= \\T^\i o e a ^ r, . 


Hence, by taking norms in (32), we obtain (31). 


□ 


Remark 12 The matrices Ai and have a probabilistic interpretation. Namely, 
the matrix Ai is the probability that the BMAP is absorbed after time 1, and at 
time 1 there have been i < n arrivals; the matrix s[ K ^ is the probability that the 
BMAP is absorbed after time 1, and at time 1 there have been i, or i + K, or 
i + 2K, or ..., arrivals. Clearly, there are more trajectories favourable for 
than for Ai and Ai < S\ K ’. Similarly, there are more trajectories favourable 
for Si than for for a positive integer t. This shows that, if we take 

a sequence of integers £i, £ 2 , • ■ ■, and a sequence Kq, Ki, K 2 , ■ ■ ■, such that 
K-n+i = £n+ iN n , then 


s (Ko) > giKt) > s (K 2 ) > ... > A . 

for i = 0,..., A'o — 1. Therefore, the sequence {S 1 ^} has some monotonicity 
property in its convergence to A. 


The bound in (31) shows that the error has an exponential decay as K 
increases. Moreover, such bound holds for any a > 1. Therefore we can fix a 
tolerance e and a cr > 1, and find K such that /jc(o') < e. Since we would like to 
keep K as low as possible, another way to proceed is to fix a tolerance e and find 
er such that the size K for which /ft-(ex) < e is minimum. More specifically, after 
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some manipulations, from the condition /^(cr) < e we obtain that K > g(a) 
where 

a(cr”" * 1 2 3 4 5 6 7 - 1)+ log(cr/(<r - 1))+log(e _1 )+ log(e l|L|l “ - 1) 

SW =-,—rb-+ n - 

k)g(cr) 

Since a > 1 is arbitrary, we choose a such that g(cr) has a minimum value. In 
fact, the function g(a) diverges to infinity as a tends to 1 and to oo, therefore 
it has at least a local minimum a* and we can choose K > g(cr*). 

When we perform the computation in floating point arithmetic, we have to 
consider also the error generated by roundoff in computing the exponential of 
a block-circulant matrix. In practical computations, we obtain a block-vector 
with components Yi = Yi + Ay;, k = 0 ,..., n — 1, where || Ay { ||oo is bounded in 
Theorem [7] and Yi = A t + E[ where, by Theorem 0 E[ is such that 

\\[E' 0 ,...,E l n _ 1 ]\\ oo <f K (a). 

Altogether, for the overall error Ei = Ay ; + £?', one has 

Halloo < HAyJoo + Halloo < m/rx + /if(o-). 

A similar analysis can be carried out for the relative error. In this case the 
inequality /ic(cr) < e is replaced by f K (c t) < e, for e = e||[A 0 ,..., A„_i]|| 00 . So 
that the function g(a) is modified by replacing e with e. 

Like at the end of Section [572] in the overall estimate of the error, the quanti¬ 
ties ||T||oo an d maxr s ^ |(t/ft,) r . jS | are bounded from above by a = maxj(—(Uo)j t j), 
and we may scale the block-vector U so that these quantities are bounded by 1. 

The overall procedure is summarized in Algorithm [TJ where the repeated 
squaring of the block-triangular block-Toeplitz matrices can be performed by 
using Algorithm |2j as explained in Remark |4j 


Algorithm 7: Exponential of a block-triangular block-Toeplitz matrix by 

using embedding into a circulant matrix 

Input : The block-vector U = (Uj) j = o,™-i, an integer K > n 

Output : The block-vector Y = ( Yi)i=o, n -i, that is an 

approximation of the first block-row of e^^ u > 

1 Set a = maxj(—({7o)j^), p = Ll°g 2 a \ + 1 an d U = U/2 P 

2 Set W = (Wi)j = o,ic-i with W* = Ui for i = 0,..., n — 1, Wj = 0 for 
i = n ,..., K — 1 

3 Apply Algorithm 0 to compute the first block-row V = (I^)i=o k-i of 
e C(W) 

4 Set Yi = Vi, for i = 0,..., n — 1. 

5 for r=l,...,pdo 

6 | compute T(Y) = T(Y)T(Y) 

7 end for 
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5.4 Taylor series method 

In this section we use the Taylor series method for computing the exponential 
of an essentially nonnegative matrix, where the block-triangular block-Toeplitz 
structure is exploited to perform fast matrix-vector multiplications. The com¬ 
putation of the exponential of an essentially nonnegative matrix have been an¬ 
alyzed in [20] and HZ|. 

Following [2Q] and m the Taylor series method is applied to compute e T ^ u \ 
since the matrix T(U) = T(U) + al is nonnegative and e r ^ u ' ) can be obtained 
by means of the equation = e~ a e r ^ u \ In this way, we avoid possible 

cancellations in the Taylor summation. 

Denote by S r (T(U)) the Taylor series truncated at the rth term, namely 


Sr(T(U)) = £ 

k=0 


T(U) k 

k\ 


The following bound on the approximation error is given in P 
Theorem 13 Let r be such that p(T(U)/(r + 1)) < 1. Then 


le nu)_ Sr{T 0 )) \<mi 


I- 


T(U)\ 


The scaling and squaring method is used to accelerate the convergence of 
the Taylor series, by using the property that 


e nu) = e -<* e nu) = 


(e _a/2P e r(&)/2P ) 


Indeed, if p is an estimate of p(T{U)), and if p = [log 2 p\ +1, then p(T(U)/2 p ) < 
1 and the truncated Taylor series expansion is used to approximate e ' 1 ~(U)/ 2 p _ 
Since T(U) is block-triangular block-Toeplitz, then p(T{U)) = p{Uo)- 

The Toeplitz structure is used in the computation of the Taylor expansion 
and in the squaring procedure. In fact, the computation of each term in the 
power series expansion consists in performing products between block-triangular 
block-Toeplitz matrices, that can be done by applying Algorithm [5] in view of 
Remark [4] similarly in the squaring procedure at the end of the algorithm. 

Concerning rounding errors, we observe that the Taylor polynomial is the 
sum of nonnegative terms. Therefore no cancellation error is encountered in 
this summation. The main source of rounding errors is the computation of 
the powers T{U) k for k = 2,..., which are computed by means of Algorithm 
[2] in view of Remark [3] relying on FFT. We omit the error analysis of this 
computation, which is standard. However, we recall that in view of Theorem 
[5j FFT is normwise backward stable but not component-wise stable. For this 
reason, for the truncation of the power series it is convenient to replace the 
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component-wise bound expressed by Theorem [13] by the norm-wise bound 


|| e r (^) — S' r (T(f7))||oo < 


T(UY L T(U) 

r! l r +1 



OO 


from which we obtain that the condition 


T(UY 

r! 



T(U) \ 1 

r+1 ) 


<e S r (T{U)) 


implies that \\e T ^'> - S r (T(U ))||oo < e||Sy(T(f7))||oo- 
The overall procedure is stated in Algorithm [8] 

It is worth pointing out that, if the computation of the powers of the tri¬ 
angular Toeplitz matrices is performed with the standard algorithm then the 
computation is component-wise stable as shown in |20|. 


OO 


Algorithm 8 : Exponential of a block-triangular block-Toeplitz matrix by 
using Taylor series expansion 

Input : The block-vector U = (Ui)i=o, n -i defining the first 

block-row of T(U), a tolerance e > 0, a maximum number 
of iterations K 

Output : The block-vector Y = (Yi ) i=0 n _i, that is an 

approximation of the first block-row of 

1 Set a = maxj(—(Uo)j,j) 

2 Set U — (f7i)i=o,n-i) U 0 = U 0 + al, Ui = Ui, i = 1,... ,n - 1 

3 Compute p an estimate of p(Uo), or set p = ||CAd||oo 

4 Compute p = [log 2 p\ + 1 and V = U/2 P 

s Set W = V and Y = (Y^n-i, Y 0 = I+ V 0 , Y f = V it i = 1,.. .,n-1. 
e for r = 2,..., K do 

7 Compute T(W) = T(V)T(W/r) and Y = Y + W 

8 if || WHoo < e||T||oo then 

9 | break 

10 end if 

11 end for 

12 Compute Y = e~ a ^ 2P Y 

13 for i = 1 ,... ,p do 

14 | compute T(Y) = T(Y)T(Y) 

is end for 


6 Numerical experiments 

The numerical experiments have been performed in Matlab. To compute the 
error obtained with the proposed algorithms we have first computed the expo¬ 
nential by using the vpa arithmetic of the Symbolic Toolbox with 40 digits and 
we have considered this approximation as the exact value. 
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n 

cw-abs 

cw-rel 

nw-abs 

nw-rel 

128 

8.4e-14 

5.8e-ll 

6.5e-12 

8. Oe-12 

256 

l.le-14 

8.9e-ll 

M- 

(D 

1 

I- 4 - 

to 

2.le-12 

512 

1.2e-14 

2.6e-09 

7.7e-13 

9.6e-13 

1024 

2.2e-14 

9.le-04 

1.9e-12 

2.4e-12 


Table 1: Errors generated by Algorithm [6j based on the e-circulant technique, 
with e = i • 10” 2 


Denote by A h = 0,... ,n — 1, the approximations of the blocks on the first 
row of and define the four errors 

cw-abs = max|(A ft )ij - (A h )i,j |, 

h,i,j 

cw-rel = ma x{\(A h ) itj - (A h )ij\/\(A h ) itj \}, 

h,i,j 

nw-abs = ||[A 0 - A 0 , ... , A n _ t - A^JHoo, 

nw-rel — || [Ao A@ ,..., A n _r A n _i] ||oo/1| [Ao , • • •, A n _i] 11oo? 

which represent absolute/relative component-wise and norm-wise errors, respec¬ 
tively. 

We compare the accuracy and the execution times of the proposed algo¬ 
rithms. 

The test matrix T(U) is taken from two real world problems concerning 
the Erlangian approximation of a Markovian fluid queue m- The block-size 
n of T{U) is usually very large since a bigger n leads to a better Erlangian 
approximation, while the size m of the blocks is equal to 2 for both problems. 

We show the performances in terms of accuracy of the algorithm based on 
the e-circulant matrix. In Table [l] we report the errors generated by Algorithm [6] 
with e = i • 10 -2 applied to the first problem. Observe that the errors are much 
smaller in magnitude than |e|. The component-wise and norm-wise absolute 
errors range around 10~ 14 — 10~ 12 , while the componentwise relative errors 
deteriorate as n increases; the norm-wise relative errors moderately increase as 
n increases. This behavior is expected since the use of FFT makes the algorithm 
stable in norm, while the component-wise accuracy is not guaranteed. 

In Figure [l]we report the absolute/relative component-wise and the relative 
norm-wise errors as a function of e = i • d, with 6 varying from ICE 10 to 10°, 
in the case n = 512. In Figure [Ta| the scaling technique is not applied, while 
in Figure [Tb] the scaling is applied, as described in Algorithm [6j It is worth 
pointing out how the scaling allows to obtain a better accuracy, and the best 
performances are obtained with a larger value of |e|. Observe also that with 
the scaling technique the component-wise relative error takes values close to 
ICE 9 while the theoretical bound is asymptotically (2/3)/x « l.e — 10. Another 
interesting remark is that the absolute component-wise errors and the relative 
norm-wise errors reach a minimum value for a moderately large value of 0, and 
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Figure 1: Error as function of e = id for the e-circulant algorithm, with n = 512 



Figure 2: Error as function of 9 for the e-circulant algorithm, with k interpola¬ 
tion points and n = 512 


substantially increase for values smaller than this minimum. This is due to the 
effect of round-off errors, which increase as |e| goes to zero. 

In Figure [2] we report the normwise relative errors obtained with the e- 
circulant technique, described in Algorithm [5j applied with k different values 
6j = (i ) 1 / k uj J k 9, for j = 0 ,..., k— 1, where the solution is the arithmetic mean of 
e C ei ((7) ^ i n t ere sting to observe that using k — 2 leads to an approximation 
error better than k = 1, while for k = 4 the solution provided by the algorithm 
has an error close to the machine precision. Actually from this picture it is 
possible to figure out where the approximation errors and the roundoff errors 
dominate. For k = 4 the graph of the overall error is almost decreasing, this 
shows that the approximation error is removed by the technique of averaging the 
approximations obtained with different values of ej. From this behaviour one 
deduces that the approximation error numerically behaves like a polynomial 
of degree less than 8. This guess should be worth being investigated from a 
theoretical point of view. 

Now consider the method based on the embedding into a circulant matrix. 
In Table [2] we report the errors generated by Algorithm [ 7 ] with K = An applied 
to the first problem. The component-wise absolute errors are of the order of 
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n 

cw-abs 

cw-rel 

nw-abs 

nw-rel 

128 

2.Oe-16 

1 .4e-12 

8.9e-15 

l.le-14 

256 

4.Oe-16 

2.4e-ll 

2.2e-14 

2.8e-14 

512 

8.5e-16 

2.5e-09 

4.3e-14 

5.4e-14 

1024 

6.3e-16 

3.4e-04 

8.4e-14 

1.0e-13 


Table 2: Errors generated by Algorithm [7J based on the embedding technique, 
with K = 4 n 




Figure 3: Error as function of K for the embedding algorithm, with n = 512 


the machine precision, while the component-wise relative errors deteriorate as 
n increases; the norm-wise relative errors remain quite small as n increases. 
As for the e-circulant method, this behavior is expected for the use of FFT. 
The accuracy of this algorithm is better than that obtained with the e-circulant 
method. 

In Figure [3] we report the absolute/relative component-wise and relative 
norm-wise errors as a function of AT, in the case n = 512. In Figure [3a] the 
scaling technique is not applied, while in Figure [3b] the scaling is applied, as 
described in Algorithm [7} Also in this case it is worth pointing out how the 
scaling allows to obtain a better accuracy and optimal performances with smaller 
value of the block-size K, that is 4 n vs. 8 n. 

In Table [3] we report the errors generated by Algorithm [8] based on Taylor 
expansion. The errors have the same magnitude as those of Table [2] for the 
method based on the embedding. 

In Table [4] we report the CPU time in seconds, as a function of n, needed by 
the algorithm based on e-circulant matrix (epc), on embedding into a circulant 


n 

cw-abs 

cw-rel 

nw-abs 

nw-rel 

128 

4.7e-16 

9.5e-13 

5.7e-15 

7. Oe-15 

256 

1.8e-15 

4.3e-12 

2.2e-14 

2.8e-14 

512 

8.9e-16 

1.3e-09 

3.Oe-14 

3.8e-14 

1024 

4.8e-15 

7.Te-04 

1.3e-13 

1.6e-13 


Table 3: Errors generated by Algorithm |8j based on Taylor expansion 
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Algorithm \ n 

256 

512 

1024 

2048 

4096 

epc 

0.2 

0.5 

1.5 

4.6 

16.0 

emb 

0.4 

0.9 

2.4 

6.8 

22.4 

taylor 

0.6 

1.4 

3.8 

11.5 

37.6 

expm 

0.9 

5.9 

327.7 

* 

* 


Table 4: CPU time as function of the block-size n 



Figure 4: Error as function of e = i 6 for the e-circulant algorithm, with n = 512 


matrix (emb), on Taylor series expansion (taylor) and by the expm function 
of Matlab. The symbol denotes an execution time greater than 100 sec¬ 
onds. The time needed by expm increases much faster than the time needed 
by the other methods. The method epc is the fastest, and the method based 
on embedding is slightly faster than the Taylor series method when n is large 
enough. 

Concerning the second problem, we report only the results in the case where 
scaling is applied. In fact, there is not much differences between the sclaed 
and the unsealed versions since this problem is already well scaled in its original 
formulation. In Figure[4]we report the errors for the method based on e-circulant 
matrices. It is interesting to note that the optimal value of |e| is close to 1 and 
that the component-wise relative error is minimized by values of |e| greater 
than 1. This fact, which apparently seems to be a contradiction, is explained as 
follows. Large values of e generate large errors in the lower triangular part, i.e., 
the lower triangula part of e rl ' U ^ — e c ‘has large norm. On the other hand 
we consider the first block-row of e Ce ^ to approximate the matrix exponential 
of T(U), therefore the errors are not influenced by a large error in the lower 
triangular part. 

In Figure [5] we report the errors for the algorithm based on embedding. It is 
relevant to observe that the errors are essentially minimized with an embedding 
of just double size. 

To conclude, the method based on e-circulant is the fastest one, but the ac¬ 
curacy of the results is lower than that provided by the embedding and Taylor 
series expansion. However, by applying the averaging technique we can dramat¬ 
ically improve the accuracy of the e-circulant algorithm. 

The computational time of all the structured algorithms is much lower than 
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Figure 5: Error as function of K for the embedding algorithm, with n = 512 


the cost of the general method implemented in the expm function of Matlab and 
allows to deal with matrices with huge size. 

The algorithms based on embedding, on e-circulant matrices are faster than 
the one based on Taylor series with FFT matrix arithmetic. Moreover they are 
better suited for a parallel implementation. 
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